The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.
NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!
Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.
If you are attending Day 4: R packages and version control, you will need to install Git for Windows, RTools, and devtools and roxygen2 packages.
Download the latest 64-bit Git for Windows by clicking on the “Click here to download” link at the top, and installing the file. Once installed, RStudio typically can find it.
Download files and instructions for Rtools installation are available on CRAN’s RTools page. It’s a large file and may require admin privileges to install, so be sure to install prior to training. You must also be using R 4.0 or higher for this training, and be sure to download Rtools4.
After you install Rtools, you’ll want to install the devtools package. The devtools package allows you to install packages from github, and will be the easiest way for others to install your packages in their R environment. Run the code chunk below to make sure everything is installed and running correctly. The devtools package has a lot of dependencies, and you often have to install new packages or update existing packages during the process. If you’re asked to update libraries while trying to install devtools, you should update them. The most common reason the devtools install doesn’t work is because you have an outdated version of one of its dependencies installed.
install.packages('devtools')
library(devtools)
The roxygen2 package is a dependency of devtools, and it should be installed if you successfully installed devtools. However, it’s always good to check that it installed properly. The roxygen2 package helps with package documentation. The usethis package is relatively new, and some features that used to live in devtools now live in usethis.
library(roxygen2)
library(usethis)
Once these packages load without errors, you’re all set. If you have any issues making this work, contact Kate Miller or Sarah Wright prior to training, and we’ll help you troubleshoot.
The entire training will take place over 4 half days. Each day will run from 10-2 MST via MS Teams. The hour before training, and the hour after training will also be posted by at least one trainer as office hours, in case there are questions that couldn’t be handled during training.
Each training session has three trainers assigned, two of which will be the main instructors. The third trainer will manage the chat. For most of the training, one of the trainers will share their screen to walk through the website and then demo with live coding. It will help a lot if you have 2 monitors, so you can have the screen being shared on one monitor and your own session of RStudio open on the other.
Half days are barely enough to scratch the surface of the more advanced topics we’re covering in R. Our goals with this training are to help you get beyond the initial learning curve, so you’re armed with the skills you need to continue learning on your own. The trainers put lot of thought and time into designing this training. We did it because we enjoy coding in R and it has greatly increased efficiency and productivity in our jobs. We hope you have a similar experience as you begin applying and learning more about the tools we’re covering this week. As you learn more, please share your skills and code with others to help us build a larger community of R users in IMD who can learn from each other.
Finally, to help us improve this training for future sessions, please leave feedback in the training feedback form.
You can download USFS FIA data in R using their DataStore url. The code below downloads multiple states’ PLOT table, prints the length of the table (which you’ll want to check against the DataStore recrods), and saves them to a data folder on my hard drive.
# Create function to iterate download
downFIA <- function(state, table){
csv_name <- paste(state, table, sep = "_")
csv_link <- paste0('http://apps.fs.usda.gov/fia/datamart/CSV/', csv_name, '.csv')
assign(csv_name, read.csv(csv_link))
write.csv(csv_name, paste0("./data/", csv_name, '.csv'))
}
# Create list of states of interest
states = c("CT", "RI", "DE")
# Use purrr::map to download each state's PLOT table.
purrr::map(states, ~downFIA(., "PLOT"))
Note that you can take this another step further with purrr::map2() to download from a list of states and a list of tables. But note that these are big files and the process may crash if trying to do too much in one call.
rFIA
The rFIA package was recently developed through a cooperative agreement between NETN and Hunter Stanke, a grad student at Michigan State University. The package was designed to make it easier to access and import FIA data into R and to perform common queries and analyses for specified areas and time series with the FIA data.
FIA dataset are huge and can take awhile to download. For the purposes here, we’re going to download DE and RI’s data, since they’re the smallest.
#install.packages("rFIA")
library(rFIA)
# Download CT and RI state data tables to data folder
getFIA(states = c('DE', 'RI'), dir = "./data")
# Download reference tables (ie lookup tables)
getFIA(states = 'ref', tables = c('SPECIES', 'PLANT_DICTIONARY'))
The code below will import database objects into your global environment. To see the names of the tables within the database object, you can just write names(fia_all). To view a specific table, you can write fia_all$PLOT.
# Import all states in R that's in the data folder
fia_all <- readFIA("./data")
names(fia_all) # Print table names
table(fia_all$PLOT$STATECD) # View number of plots by state code. Should be 2.
table(fia_all$PLOT$INVYR) # Range of inventory years in the data
# Import RI data only
fia_ri <- readFIA("./data", states = 'RI')
names(fia_ri)
head(fia_ri$PLOT)
table(fia_ri$PLOT$STATECD) # View number of plots by state code. Now only 1.
table(fia_ri$PLOT$INVYR)
Other really useful features of the rFIA package are that you can clip the data to a time period or county. The code below clips the FIA data for RI to the most recent survey of each plot in their sample design.
# Clip all data to most recent sample
all_recent <- clipFIA(fia_all, mostRecent = TRUE)
# Print list of counties in RI
countiesRI
# Clip RI to Washington County
ri_Wash <- clipFIA(fia_ri, mask = countiesRI[5,], mostRecent = F)
You can also calculate common summary statistics and sampling errors using other functions within rFIA, including carbon, biomass, diversity, down woody material, seedling density, tree population estimates, tree growth and mortality, etc. I’ll show this for the tree population estimates using the tpa() function. These functions also allow you to summarize or group at multiple levels, including the state level, the plot level, the subplot level and the tree level. You can also group by species level and size class using this function. Grouping by ownership is also possible, which allows federal lands to be summarized differently than private lands. The grpBy argument also allows you to specify multiple grouping variables.
fia_ri_rec <- clipFIA(fia_ri, mostRecent = TRUE)
# State level tree population estimates
tpa_ri <- tpa(fia_ri)
# Plot level tree population estimates for most recent survey
tpa_ri_plot <- tpa(fia_ri_rec, byPlot = TRUE)
# Species level and size class level tree population estimates
tpa_ri_sppsz <- tpa(fia_ri_rec, bySpecies = TRUE, bySizeClass = TRUE)
# by Ownership
tpa_ri_own <- tpa(fia_ri_rec, grpBy = OWNGRPCD)
The package also has some built-in plotting functions that help you visualize the data created by the previous step. The plot below is the trees per acre for the state of RI over time.
head(tpa_ri)
## # A tibble: 6 x 8
## YEAR TPA BAA TPA_SE BAA_SE nPlots_TREE nPlots_AREA N
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 2005 516. 112. 10.6 3.89 66 67 95
## 2 2006 496. 111. 8.85 3.23 93 95 140
## 3 2007 501. 112. 6.89 3.29 125 126 198
## 4 2008 499. 115. 6.86 3.36 121 122 193
## 5 2009 516. 116. 6.86 3.21 116 118 191
## 6 2010 504. 118. 6.95 3.19 118 119 196
plotFIA(tpa_ri, y = TPA, plot.title = "Trees per acre")
If you’ve tried to do GIS and make maps in R even a few years ago, you probably encountered the same frustrations I did. There were a ton of packages out there, each with its own file format and coding convention, and each package rarely had all the features I needed. It was not easy to navigate, and I often found myself just exporting my work out of R and doing the rest in ArcGIS… Enter the sf and tmap packages, which are the latest and greatest R packages devoted to GIS and map making! Most of the frustrations I had with earlier packages have been resolved with these two packages.
The sf package is the workhorse for anything you need to do with spatial vector data. File types with sf are called simple features, which follow a set of GIS standards that are becoming the universal data model for vector-based spatial data in R and that most GIS-based packages in R now employ. Simple features are also now the standard for open source GIS in general. That means if you’re trying to troubleshoot something with simple features, you’ll need to specify that it’s for R, rather than PostGIS or some other implementation. The sf package is also superseding the rgdal package, which used to be the more common data model in R and open source GIS. The more I use this package, the more impressed I am with how intuitive it is to use, and how well documented the functions are. For vector data, I have yet to need to perform a task that sf couldn’t do.
The main application of tmap package is making maps, and it was designed using a grammar of graphics philosophy similar to ggplot2. In practice for tmap, it means that maps are built as a collection of many small functions/layers that get added together, and order matters. There are also tmap-enabled functions that you can use in ggplot2 plots too, but you can do a lot more in tmap. I also prefer the look of tmap’s built-in compass, legends, etc. over the ones available in ggspatial, which is an add-on package for plotting spatial data in ggplot2.
The raster package is also an excellent package for analyzing/processing raster data. For large jobs, I find the raster package easier to work with than ESRI tools, and it tends to run a lot faster than ESRI built-in tools (I haven’t compared with python).
Finally, the leaflet package in R allows you to create interactive maps as html widgets. These are often included in R shiny apps, but they can also be used in R Markdown with HTML output (for more on that, attend next Wednesday’s session on R Markdown). An internet connection is required for the maps to be interactive. Without an internet connection the map will show the default extent defined by the code.
leaflet package is relatively easy to use for basic mapping. For more advanced features or to customize it further, you often end up having to code in JavaScript, which is the language leaflet was originally programmed in. There’s also a lot more online help available for the JavaScript version of the leaflet library than the R version. If you’re really stumped about something, you may find your answer with the JavaScript help.
My two favorite features of sf are 1) the attribute table of a simple feature (sf’s equivalent of a shapefile) is a dataframe in R, and 2) package functions are pipeable with tidyverse functions. That means, if you want to delete points in your layer, you can use dplyr’s filter() function to filter out those points. The sf package will update the geometry of the layer to remove the points that were filtered out.
To demonstrate the use of sf and tmap, I’m going to generate a simple GRTS sample using spsurvey, which now connects to sf instead of sp and rgdal. Then we’ll filter out points that don’t fall in a forest habitat to demonstrate how to work with sf data in R. Finally we’ll plot the results using tmap.
I wasn’t able to figure out how to post a shapefile that you could easily download in R with a url. I emailed them to you as data.zip the previous week. I also posted all of the files to our Scientists Team, which can be downloaded using the following link: https://doimspp.sharepoint.com/:f:/s/ScientistsTraining2022/EiXOTYV1l4RCk1sMr5yXhZUB1ZFEaAlAN4elvsYbBfYHRg?e=ktVy5n. To follow along, you’ll need to download (and unzip if you’re using the email attachment) the shapefiles and save them to your working directory.
Once those steps are completed, we’re ready to generate a GRTS sample and start making a map. Note that I’m using 6-digit hex colors (i.e., “#ffffff” is white) to define the fill color for each habitat type. To find your own or see what color these look like, check out htmlcolorcodes.com
#install.packages(c("sf", "spsurvey"))
library(dplyr) # for filter, select, mutate and %>%
library(sf)
library(tmap)
library(spsurvey)
# Generate a random number for the seed
sample(1:100000, 1) #62051
set.seed(62051)
# Read in shapefiles from teams data folder
sara_bound1 <- st_read("./shapefiles/SARA_boundary_4269.shp")
sara_veg1 <- st_read("./shapefiles/SARA_Veg.shp")
# Check that the projections match; fix the one that doesn't match
st_crs(sara_bound1) == st_crs(sara_veg1) # FALSE- projections don't match.
# sara_bound1 needs to be reprojected to UTM Zone 18N NAD83.
sara_bound <- st_transform(sara_bound1, crs = 26918)
st_crs(sara_bound) == st_crs(sara_veg1) # TRUE
# Quick plot
plot(sara_bound[1])
plot(sara_veg1[1]) # bigger extent than boundary
# Intersect boundary and veg to be same extend
sara_veg <- st_intersection(sara_veg1, sara_bound)
plot(sara_veg[1])
# View attribute table of layers
head(sara_bound) # 1 feature with 95 fields
str(sara_veg)
head(sara_veg)
names(sara_veg)
table(sara_veg$ANDERSONII)
# Simplify vegetation types for easier plotting
dev <- c("1. Urban or Built-up Land", "11. Residential",
"12. Commercial and Services", "13. Industrial",
"14. Transportation, Communications, and Utilities",
"17. Other Urban or Built-up Land")
crop <- c("21. Cropland and Pasture",
"22. Orchards, Groves, Vineyards, and Nurseries",
"31. Herbaceous Rangeland")
shrubland <- c("32. Shrub and Brush Rangeland")
forest_up <- c("41. Deciduous Forest Land", "42. Evergreen Forest Land",
"43. Mixed Forest Land")
forest_wet <- c("61. Forested Wetland")
open_wet <- c("62. Nonforested wetland", "62. Nonforested Wetland")
water <- c("5. Water", "51. Streams and Canals", "53. Reservoirs")
unk <- "Unclassified"
# Create 2 fields in the veg attribute table: simp_veg, and fills
sara_veg <- sara_veg %>%
mutate(simp_veg = case_when(ANDERSONII %in% dev ~ "Developed",
ANDERSONII %in% crop ~ "Open field",
ANDERSONII %in% shrubland ~ "Shrublands",
ANDERSONII %in% forest_up ~ "Forest",
ANDERSONII %in% forest_wet ~ "Forested wetland",
ANDERSONII %in% open_wet ~ "Open wetland",
ANDERSONII %in% water ~ "Open water",
ANDERSONII %in% unk ~ "Unclassified",
TRUE ~ "Unknown"),
fill_col = case_when(simp_veg == "Developed" ~ "#D8D8D8",
simp_veg == "Open field" ~ "#f5f0b0",
simp_veg == "Shrublands" ~ "#F29839",
simp_veg == "Powerline ROW" ~ "#F9421D",
simp_veg == "Forest" ~ "#55785e",
simp_veg == "Forested wetland" ~ "#9577a6",
simp_veg == "Open wetland" ~ "#c497d4",
simp_veg == "Open water" ~ "#AFD0F2",
simp_veg == "Unclassified" ~ "#ffffff"))
Before moving to the next step, I wanted to show how easy it is to create simple features from dataframes that contain X/Y coordinates. We’ll read in a fake dataset in the GitHub repo for this training, and call it wetdat. The dataset contains fake species composition data for wetland monitoring sites in ACAD and includes X and Y coordinates. We’ll use dplyr functions to calculate the number of invasive and protected species in each site by year combination. Then we’ll make it a simple feature, and save it as a shapefile. Note that there are multiple formats you can save simple features as. I show the shapefile version, in case you find yourself going between R and ArcGIS.
library(dplyr)
# Import data
wetdat <- read.csv(
"https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv")
# Summarize so that each site x year combination has 1 row
wetsum <- wetdat %>% group_by(Site_Name, Year, X_Coord, Y_Coord) %>%
summarize(num_inv = sum(Invasive), num_prot = sum(Protected),
.groups = 'drop')
# Check first 6 rows of output
head(wetsum)
# Convert dataframe to sf
wetsum_sf <- st_as_sf(wetsum, coords = c("X_Coord", "Y_Coord"), crs = 26919)
# ACAD is UTM Zone 19N NAD83, hence the difference between SARA, which is 18N.
# Write sf to disk as shapefile
st_write(wetsum_sf, "ACAD_wetland_data.shp")
The spsurvey package has been updated to point to sf instead of rgdal. It’s a code-breaking change if you have old R scripts that generated GRTS samples. However, the process is even easier now, as you can see below. Here we’re just going to use the simplest of GRTS designs. The output from grts() has multiple slots. The one we want is sites_base, and you can see how we get that slot in the code below.
Note: While I followed the approach documented in the spsurvey vignette to generate reproducable GRTS samples, it does not appear to be working as I’m testing it. Despite running set.seed(62051) after loading spsurvey and then running the code chunk below, each sample appears to be different.
sara_grts <- grts(sara_bound, n_base = 100) # generate 100 points within SARA boundary
sara_grts$sites_base$priority <- as.numeric(row.names(sara_grts$sites_base)) # add priority number (same as row.name)
First step here is to spatially join the sara_grts layer to the sara_veg layer. Here we only cared about the sara_veg’s simp_veg field, so I used dplyr’s select() function. After joining, you should see a simp_veg field added to the end of the grts layer (it’s actually to the left of geometry, which is the last). In the next step, we then filter the points in the newly created grts_veg layer to only include points that fall in forest habitat.
# Spatial join
grts_veg <- st_join(sara_grts$sites_base, sara_veg %>% select(simp_veg))
names(grts_veg)
# Create list of forest habitats
sort(unique(sara_veg$simp_veg))
forest_veg <- c("Forest", "Forested Wetland")
# Filter out non-forest points
grts_forest <- grts_veg %>% filter(simp_veg %in% forest_veg)
nrow(grts_veg) # 100 points
nrow(grts_forest) # fewer points
Now it’s time to plot. The code below may seem a bit much, but we’ll break it down piece by piece. The great thing about building maps this way is that you’re essentially building a template in code that you can steal from and adapt for future maps. You can even turn these into a function that’s even easier to use in future maps. That’s beyond what we can cover here, but it’s another great benefit of making maps in R instead of ArcGIS.
The code below is broken into the various pieces that make up the map. The way tmap works is that first, you have to add the layer via tm_shape(), and then you specify how you want that layer to look, like tm_fill(), or tm_border(). Each piece has its own legend as well. This is similar to how you start each ggplot2 graph defining the data with ggplot(data, aes(x = xvar, y = yvar)), and then start adding geom_point(), or whatever else to define how the data are plotted. The difference with tmap is that every layer you want in the map has to be coded this way. Finally tm_layout() is similar to theme() in ggplot2, and is where you can customize the map layout.
for_legend makes a list of the habitat types in simp_veg and their associated colors. That saves time having to type them all over again, and potentially spelling them wrong.
tm_shape() in the map sets the projection and the bounding box. If you don’t set the bounding box, the map will show the largest extent of your layers. So if you have a roads layer at the state-level, your map will be zoomed at that extent, instead of the park boundary.
tm_fill(), which fills in colors, and tm_borders(), which only plots outlines. If you want both borders and fills, use tm_polygons() instead.
tm_add_legend() allows you to set the order that each legend appears in the map. Otherwise, they’ll appear in the order you specify the layers.
tm_symbols() allows you to change the symbol format in a similar way to ggplot2. We then added tm_text() to label each point using the numbers in the priority column of the grts_forest attribute table. The xmod and ymod allow you to offset labels from the points either horizontally and vertically. In this case, negative xmod moves the label to the left, and a negative ymod moves the label below the point. The default location for labels is directly on top of the point.
tm_layout() code is where you can change the default settings of the figure, including font size, placement of the legend, and margins. The title name and position are also set in the tm_layout().
tmap_save() allows you to write the map to disk and to specify the height and width of the map. I prefer to write maps to disk to see what the size looks like before changing any of the layout and font size settings, because the figure on your disk will look different (and often better) than the plot view in your R Studio session.
# Creating list of simp_veg types and their fill colors for easier legend
for_legend <- unique(data.frame(simp_veg = sara_veg$simp_veg, fill_col = sara_veg$fill_col))
sara_map <-
# Vegetation map
tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
tm_fill("fill_col") +
tm_add_legend(type = 'fill', labels = for_legend$simp_veg,
col = for_legend$fill_col, z = 3) +
# Park boundary
tm_shape(sara_bound) +
tm_borders('black', lwd = 2) +
tm_add_legend(type = 'line', labels = "Park Boundary", col = "black",
z = 2)+
# GRTS points
tm_shape(grts_forest) +
tm_symbols(shapes = 21, col = "#EAFF16", border.lwd = 0.5, size = 0.3) +
tm_text("priority", size = 0.9, xmod = -0.4, ymod = -0.4) +
tm_add_legend(type = 'symbol', labels = "GRTS points", shape = 21,
col = "#EAFF16", border.lwd = 1, size = 0.5,
z = 1) +
# Other map features
tm_compass(size = 2, type = 'arrow',
text.size = 1, position = c('left', 'bottom')) +
tm_scale_bar(text.size = 1.25, position = c("center", "bottom")) +
tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), # make room for legend
outer.margins = 0,
legend.text.size = 1.25,
legend.just = 'right',
legend.position = c("right", "bottom"),
title = "Saratoga NHP GRTS points",
title.position = c("center", "top"))
tmap_save(sara_map, "SARA_GRTS.png", height = 10.5, width = 8,
units = 'in', dpi = 600, outer.margins = 0)
Here’s what sara_map looks like in your plot viewer:
sara_map
Here’s what the map looks like after it’s written to disk and the dimensions are set using tmap_save():
The last cool thing to show with tmap, is that you can include interactive HTML widgets similar to what leaflet can do (coming next). With the interactive mode, you can change baselayers, turn your layers off/on, and zoom to different extent. The legend in the interactive mode is a bit limited to only show fills, but it’s still pretty cool.
tmap_mode("view") # turn interactive mode on
sara_map